seo

The Illustrated Guide to Duplicate Content in the Search Engines

If there’s one issue that causes more contention, heartache and consulting time than any other (at least, recently), it’s duplicate content. This scourge of the modern search engine has origins in the fairly benign realm of standard licensing and the occasional act of plagiarism. Over the last five years, however, spammers in desperate need of content began the now much-reviled process of scraping content from legitimate sources, scrambling the words (through many complex processes) and re-purposing the text to appear on their own pages in the hopes of attracting long-tail searches and serving contextual ads (and other, various, nefarious purposes).

Thus, today, we’re faced with a world of “duplicate content issues” and “duplicate content penalties.” Luckily, my trusty illustrated Googlebot and I are here to help eliminate some of the confusion. But, before we get to the pretty pictures, we need some definitions:

  • Unique Content – written by humans, completely different from any other combination of letters, symbols or words on the web and clearly not manipulated through computer text-processing algorithms (like those crazy Markov-chain-employing spam tools).
  • Snippets – small chunks of content like quotes that are copied and re-used; these are almost never problematic for search engines, especially when included in a larger document with plenty of unique content.
  • Duplicate Content Issues – I typically use this when referring to duplicate content that is not in danger of getting a website penalized, but rather, is simply a copy of an existing page that forces the search engines to choose which version to display in the index.
  • Duplicate Content Penalty – When I refer to “penalties,” I’m specifically talking about things the search engines do that is worse than simply removing a page from its index.

Now, let’s look at the process for Google as it finds duplicate content on the web. In the examples below, I’m making a few assumptions:

  1. The page with text is assumed to be a page containing duplicate content (not just a snippet, despite the illustration).
  2. Each page of duplicate content is presumed to be on a separate domain.
  3. The steps below have been simplified to make the process as simple and clear as possible. This is almost certainly not the exact way in which Google performs (but it conveys the effect quite nicely).
Dup Content Illustration Phase I
Dup Content Illustration Phase II
Dup Content Illustration Phase III
Dup Content Illustration Phase IV

There are a few additional subjects about duplicate content that bear mentioning. Many of these trip up webmasters new to the dup content issue, and it’s sad that the engines themselves have no formal, disclosed guidelines for folks (although I suppose it does give folks like me a day job). I’ve written these out, as I most often hear them on the phone and see them in the forums:

Code to Text Ratio: What if my code is huge and the unique HTML elements on the page are very few? Will Google think my pages are all duplicates of one another?

Nope. As Vanessa clearly mentioned in our video together from Chicago, Google doesn’t give a hoot about your code, they’re interested in the content on your page.

Navigation Elements to Unique Content Ratio: Every page on my site has a huge navbar, lots of header and footer items, but only a little bit of content; will Google think these pages are duplicates?

Nope. Google (and Yahoo! and MSN) have been around the block a few times. They’re very familiar with the layout of websites and recognize that permanent structures on all (or many) of a site’s pages are quite normal. Instead, they’ll pay attention to the “unique” portions of each page and often, largely ignore the rest.

Licensed Content: What should I do if I want to avoid dup content problems, but have licensed content from other web sources to show my visitors?

Use meta name = “robots” content=”noindex, follow” – place this in your page’s header and the search engines will know that the content isn’t for them. It’s best to do it this way (in my opinion), because then humans can still visit the page, link to it, and the links on the page will still carry value.

Content Thieves: How should I deal with sites I find that are copying my content?

If the pages of these sites are in the supplemental index or rank far behind your own pages for any relevant queries, my policy is to generally ignore it. If we tried to fight all the copies of SEOmoz content on the web, we’d have at least two 40-hour per week jobs on our hands. Luckily, this is the only domain issuing our content that has enough link strength to rank well for it, and the search engines have placed trust in SEOmoz to issue high quality, relevant, worthy content.

If, on the other hand, you’re a relatively new site, or a site with few inbounds and the scrapers are consistently ranking ahead of you (or someone with a powerful site is stealing your work), you’ve got some recourse. One option is to file a DMCA infringement request with Google, with Yahoo!, and with MSN. The other is to file legal suit (or threaten such) against the website in question. If the site re-publishing your work has an owner in your country, this latter course of action is probably the wisest first step (I always try to be friendly before I send a letter from the attorneys), as the DMCA motions can take months to go into effect.

Percentage of Duplicate Content: What percent of a page has to be duplicate before I run into dup content penalties and issues?

22.45%. No, seriously, the search engines would never reveal this information because it would compromise their ability to prevent the problem. It’s also a near-certainty that the percentage at each engine fluctuates regularly and that there’s more than simple direct comparison that goes into dup content detection. If you really need the answer to this question, chances are you plan to do something blackhat with it.

Issues vs. Penalties: How do I know if I’m being penalized for having duplicate content, rather than simply having my pages removed from the index (or put in supplemental)?

Penalties require a good bit of abuse to go into effect, but I’ve seen it happen, even on domains from respectable brands. The penalties really arise when you start copying hundreds or thousands of pages from other domains and don’t have a considerable amount of unique content of your own. It’s particularly dangerous with new sites or those that have recently changed ownership. However, no matter whether you’ve got penalties or just find lots of your pages in supplemental hell, I highly recommend fixing the issue as I’ve described above.

What are your thoughts on dup content issues? Anything I’ve neglected or confused?

p.s. Googlebot got a nice upgrade courtesy of my improving illustration skills. I was feeling bad for the poor guy, despite the fact that it’s 2:15am and I have conference calls starting at 9am tomorrow.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button